I want to create a simulation study to do two things:
I am interested in evaluating:
I refer to the objects being clustered as items and the variables being used to cluster them as features. I use the language of Law, Jain, and Figueiredo (2003) and refer to relevant and irrelevant features referring to features that provide component specific information and thus contribute to the clustering and those that do not.
Our data generating model is a finite mixture model with \(P_n\) irrelevant features and independent features:
\[ \begin{aligned} p(x, z, \theta, \pi) &= \prod_{i=1}^N p(x_i | z_i, \theta_{z_i}) \prod_{i=1}^N p (z_i | \pi) p(\pi) p(\theta) \\ &= \prod_{i=1}^N \prod_{p=1}^P p(x_{ip} | z_i, \theta_{z_ip})^{(1 - \phi_p)} p(x_{ip} | \theta_p) ^ {\phi_p} \prod_{i=1}^N p (z_i | \pi) p(\pi) p(\theta) \end{aligned} \]
During my simulations I assume that the model in question is a mixture of Gaussians and thus \(\theta_{kp}=(\mu_{kp}, \sigma^2_{kp})\).
As each method of inference uses a common model, this data-generating mechanism is not expected to favour any one method over another.
I test seven different scenarios that change various parameters in this model. I will define the component parameters by \(\Delta_{\mu}\), the distance between the possible means in each feature, and \(\sigma^2\), a common standard deviation across all components and features. The scenarios tested are:
A more detailed description of each scenario and various sub-scenarios is given in the below table.
| Scenario | N | P_s | P_n | K | Delta_mu | sigma2 | Pi | |
|---|---|---|---|---|---|---|---|---|
| 1 | Simple 2D | 100 | 2 | 0 | 5 | 3.0 | 1 | vec(1/K) |
| 2 | No structure | 100 | 0 | 2 | 1 | 0.0 | 1 | vec(1/K) |
| 3 | Base Case | 200 | 20 | 0 | 5 | 1.0 | 1 | vec(1/K) |
| 4 | Large standard deviation | 200 | 20 | 0 | 5 | 1.0 | 3 | vec(1/K) |
| 5 | Large standard deviation | 200 | 20 | 0 | 5 | 1.0 | 5 | vec(1/K) |
| 6 | Irrelevant features | 200 | 20 | 10 | 5 | 1.0 | 1 | vec(1/K) |
| 7 | Irrelevant features | 200 | 20 | 20 | 5 | 1.0 | 1 | vec(1/K) |
| 8 | Irrelevant features | 200 | 20 | 100 | 5 | 1.0 | 1 | vec(1/K) |
| 9 | Varying proportions | 200 | 20 | 0 | 5 | 1.0 | 1 | (0.5, 0.25, 0.125, 0.0675, 0.0675) |
| 10 | Large N, small P | 10000 | 4 | 0 | 5 | 1.0 | 1 | vec(1/K) |
| 11 | Large N, small P | 10000 | 4 | 0 | 50 | 1.0 | 1 | vec(1/K) |
| 12 | Large N, small P | 10000 | 4 | 0 | 50 | 0.5 | 1 | vec(1/K) |
| 13 | Small N, large P | 50 | 500 | 0 | 5 | 1.0 | 1 | vec(1/K) |
| 14 | Small N, large P | 50 | 500 | 0 | 5 | 0.2 | 1 | vec(1/K) |
In this study we will be considering both within-simulation and across-simulaiton performance for the \(N_{sim}\) simulations run under each scenario.
The aim of each model is uncovering the true clustering of the data, \(C'\). The performance of each model may be judged by compaing the predicted clustering, \(C^*\), to \(C'\) using the adjusted rand index (\({ARI(C^*, C')}\)).
Robustness may be judged by the uncertainty on this estimate, but also across-model agreement (for Bayesian and Frequentist inference, but not Consensus inference). This needs thought. If Bayesian / frequentist have disagreement on predicted clustering within a simulation across model starts, or common clustering and different uncertainty? Measure Byaeisan convergence using gelman-rubin (across-chains) and autocorrelation (within-chain).
Time may be measured in seconds - relative time taken in each simulation? e.g. \(\min(t_{freq_l}) = t_{0l}\), \(\Delta t_{ml} = t_{ml} - t_{0l}\) for the lth sim. Boxplot.
Below is some pseudo-code describing my current plan for the simulations.
Question: do I need to save the datasets? If I set a seed we can re-create them, and I’m not sure why we would need them.
# Inputs:
# Scenarios: Scenarios defining datasets within each simulation
# nSim: The number of simulations within each scenario
# nChains: The number of chains used for CI
# nIter: The numbeer of iterations within each chain
# nBayes: The numbeer of chains run for the Bayesian method
# bayesIter: The number of iterations to run for each Bayesian model chain
# nFreq: The numbeer of different seeds used for the Frequentist method
runSimulations(Scenarios, nSim, nChains, nIter, nBayes, nFreq){
# Loop over scenarios
for scn in Scenarios{
# The various parameters that define the current scenario
(N, P, K, delta_mu, sigma2, pi, P_n) = scn$Parameters
# Objects to hold model performance score
ciScore = array(0, dim = c(nIter, nChains, nSim))
bayesScore = matrix(0, nrow = nBayes, ncol = nSim)
freqScore = matrix(0, nrow = nFreq, ncol = nSim)
# Matrices to hold the upper and lower bounds on the number of clusters
# present based upon samples recorded
nClustLower = array(0, dim = c(nIter, nChains, nSim))
nClustUpper = array(0, dim = c(nIter, nChains, nSim))
bayesNClustLower = matrix(0, nrow = nBayes, ncol = nSim)
bayesNClustUpper = matrix(0, nrow = nBayes, ncol = nSim)
# Loop over the number of simulations
for l in 1:nSim{
# Generate a new dataset for the current simulation based upon the scenario
set.seed(l)
dataset, trueClusters = generateSimulationDataset(N, P, K, delta_mu, sigma2, pi, P_n)
# Create an array to hold the predicted clusterings
predClustering = array(0, dim = c(nIter, N, nChains))
bayesClustering = matrix(0, nrow = nBayes, ncol = N)
freqClustering = matrix(0, nrow = nFreq, ncol = N)
# Array to hold consensus matrices for different number of iterations
consensusMatrices = array(0, dim = (N, N, nIter))
# List to hold the results of each Bayes model
bayesModel = list()
# Loop over the number of chains
for i in 1:nChains{
# set the seed to differentiate chains
set.seed(i)
# run nBayes Bayesian models
if(i <= nBayes){
bayesModel[[i]] = bayesianMixtureModel(dataset, bayesIter)
ciModel = bayesModel[1:nIter, ]
# Create the PSM
posteriorSimilarityMatrix = makePSM(bayesModel$clusterings)
# Create a summary clustering
bayesClustering[i, ] = clustering(posteriorSimilarityMatrix)
# Uncertainty on the number of clusters present based upon MCMC samples
# (see Sara Wade and Zoubin Ghahramani, 2015).
bayesCredibleBall = credibleball(
bayesClustering[i, ],
bayesModel$clusterings
)
# Record the score of the current Bayesian model
bayesScore[i, l] = ARI(bayesClustering[i, ], trueClusters)
bayesPredK = numberClusters(bayesClustering[i, ])
# Record the the uncertainty on the number of clusters present
bayesNClustLower[i, l] = bayesCredibleBall$lower - bayesPredK
bayesNClustUpper[i, l] = bayesCredibleBall$upper - bayesPredK
} else {
ciModel = bayesianMixtureModel(dataset, nIter)
}
# Initialise the consensus matrix
consensusMatrix = matrix(0, nrow = N, ncol = N)
for j in 1:nIter{
# The clusterings for the current chain and given
ciClusterings[j, , i] = ciModel$clusterings[j,]
# Build a consensus matrix for the current results
currConsensusMatrix = makeConsensusMatrix(ciModel$clusterings[j,])
# Update the consensus matrix pertaining to the number of chains
consensusMatrices[, , j] = currConsensusMatrix * (1 / i) + consensusMatrix[, , j] * ((i - 1) / i)
# Need some way of doing this - Paul doesn't agree with use of
# ``mcclust::maxpear``, I have to think about this
predClustering[j, , i] = clustering(consensusMatrices[, , j])
# calculate how well the model has performed in uncovering the true
# structure
ciScore[j, i, l] = ARI(predClustering[j, , i], trueClusters)
# The predicted number of cluters present
ciPredK = numberClusters(predClustering[j, , i])
# Not sure that this is appropriate either for the same reasons as
# ``mcclust::maxpear`` might not be; I have to look into this properly.
credibleBallClusters = credibleball(
predClustering[j, , i],
ciClusterings[j, ,]
)
# Record the lower and upper bounds on the number of clusters present
nClustLower[j, i, l] = credibleBallClusters$lower - ciPredK
nClustUpper[j, i, l] = credibleBallClusters$upper - ciPredK
}
# run nFreq frequentist models and record the predicted clustering
if(i <= nFreq){
freqModel = frequentistMixtureModel(dataset)
freqclustering[i, ] = freqModel$clustering
freqScore[i, l] = ARI(freqModel$clustering, trueClusters)
}
}
medCIScore = apply(ciScore[, , l], )
# Check if the different Bayesian models are converged (some test based
# upon Gelman-Rubin and auto-correlation?)
# see: coda::gelman.diag(), coda::autocorr.diag()
bayesConverged = checkConvergence(bayesModel)
}
# Make a 3D surface plot of the median score of the consensus inference
# results as a function of (nIter, nChains)
plot3D(ciScore)
# Make line plots of the median score bounded by the interquartile range over
# simulations for a subset of number of chains as a funciton of number of
# iterations and for a subset of number of iterations as a funciton of
# number of chains.
subsetChains = c(1, 10, 50, 100)
subsetIter = c(1, 100, 500, 5000)
nChainSubsets = length(subsetChains)
nIterSubsets = length(subsetIter)
facetLinePlot(ciScore, ~nChains:nIter, subsetChains, subsetIter)
# The scores for the subset of interest
ciSubsetScores = ciScore[subsetIter, subsetChains, ]
###Not sure about comparing yet -
###Also, add test (coda::gelman.diag()) for convergence in Bayesian case
# # Compare the median score of the models over simulations:
# for(i in 1:nChainSubsets){
# medCIScoreChain = ciSubsetScores[ , i, ] %>%
# apply(1, median)
# }
#
# for(j in 1:nIterSubsets){
# medCiScoreIter = ciSubsetScores[j, , ] %>%
# apply(1, median)
# }
#
# medBayesScore = bayesScore %>%
# apply(1, median)
#
# medFreqScore = freqScore %>%
# apply(1, median)
#
# # Compare median score over simulations
# boxplot(medCIScoreIter, medCIScoreIter, medBayesScore, medFreqScore)
}
}
My idea.
K <- 5
n <- 100
n_sim <- 10
n_seeds <- 50
n_iter <- 500
score_ci <- array(dim = c(n_sim, n_seeds, n_iter))
true_labels <- matrix(sample(1:K, size = n_sim * n, replace = T), nrow = n_sim)
for(l in 1:n_sim){
true_labels_l <-true_labels[l, ]
ci_out <- list()
ci_cm <- list()
pred_ci <- c()
pred_ci <- matrix(nrow = n_iter, ncol = n)
score_ci_l <- matrix(nrow = n_iter, ncol = n_seeds)
for(j in 1:n_iter){
.old_cm <- matrix(0, nrow=n, ncol=n)
ci_out_j <- matrix(0, nrow = n_seeds, ncol = n)
for(i in 1:n_seeds){
set.seed(i)
ci_out_j[i, ] <- ciSim(true_labels_l, j, K, truth_at = 1500)
# Create coclustering matrix (technically cltoSim creates a posterior similarity
# matrix from a clustering vector, but the effect in *this* case is the same)
.curr_cm_i <- mcclust::cltoSim(ci_out_j[i, ])
# .curr_cm_i <- makePSM(ci_out_j[i, ])
.curr_cm <- .curr_cm_i * (1/i) + .old_cm * ((i - 1) / i)
score_ci[l, i, j] <- .curr_cm %>%
mcclust::maxpear() %>%
.$cl %>%
mcclust::arandi(true_labels_l)
.old_cm <- .curr_cm
}
ci_out[[j]] <- ci_out_j
ci_cm[[j]] <- ci_cm_j <- .curr_cm # makePSM(ci_out_j)
pred_ci[j,] <- ci_cl_j <- mcclust::maxpear(ci_cm_j)$cl
# score_ci[j] <- mcclust::arandi(ci_cl_j, true_labels)
}
# score_ci[[l]] <- score_ci_l
}
Visualisations of results.
probs <- seq(0., 1, 0.25)
for(i in 1:n_iter){
if(i == 1){
surfaces <- apply(score_ci[, , i], 2, quantile, probs = probs)
} else {
surfaces <- cbind(surfaces, apply(score_ci[, , i], 2, quantile, probs = probs))
}
}
surfaces[,1:4]
## [,1] [,2] [,3] [,4]
## 0% 0.02079530 0.02092104 0.01332191 0.02334382
## 25% 0.03887161 0.03685330 0.02236486 0.04215909
## 50% 0.04207585 0.04565432 0.03189869 0.04843739
## 75% 0.06738553 0.06763436 0.04727810 0.05226161
## 100% 0.09367318 0.09367318 0.07071668 0.13768466
z1 <- surfaces[1, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z2 <- surfaces[2, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z3 <- surfaces[3, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z4 <- surfaces[4, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z5 <- surfaces[5, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
# col_pal <- colorRampPalette(c("#146EB4", "white", "#FF9900"))(100)
plot_ly(colors = col_pal) %>%
add_surface(x = ~1:n_seeds, y = ~1:n_iter, z = ~z2, coloraxis = 'coloraxis', opacity = 0.6) %>%
add_surface(x = ~1:n_seeds, y = ~1:n_iter, z = ~z3, coloraxis = 'coloraxis') %>%
add_surface(x = ~1:n_seeds, y = ~1:n_iter, z = ~z4, coloraxis = 'coloraxis', opacity = 0.6) %>%
layout(
scene = list(
xaxis = list(title = "Number of chains"),
yaxis = list(title = "Number of iterations"),
zaxis = list(title = "ARI predicted clustering to truth")
),
coloraxis= list(colorscale='Viridis') #,
# surfacecolor = list(col_pal),
) %>%
colorbar(title = "ARI")
Facet line plot
plt_z2 <- surface2Df(z2, n_seeds, n_iter)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(n_seeds)` instead of `n_seeds` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
plt_z3 <- surface2Df(z3, n_seeds, n_iter)
plt_z4 <- surface2Df(z4, n_seeds, n_iter)
plt_data <- dplyr::bind_cols(
plt_z3,
plt_z2 %>% dplyr::select(ARI),
plt_z4 %>% dplyr::select(ARI)
) %>%
set_colnames(c("N_iter", "N_chains", "ARI", "ARI_lb", "ARI_ub"))
chains_used <- c(1, 5, 15, 30)
iter_used <- c(1, 10, 50, 100)
N_chain_labs <- c(paste0("Number of chains ", chains_used)) %>%
set_names(chains_used)
# names(cluster_labs) <- 1:5
# New facet label names for cluster variable
chain_labels <- c(paste0("Number of chains: ", chains_used))
names(chain_labels) <- chains_used
# Create the plot
p1 <- ggplot(
plt_data %>% filter(N_chains %in% chains_used),
aes(x = as.numeric(N_iter), y = ARI, group = N_chains)
) +
geom_line()+
geom_ribbon(aes(ymin = ARI_lb, ymax = ARI_ub, group = N_chains), colour = NA, alpha =0.3) +
facet_wrap(~N_chains, ncol =1, labeller = labeller(N_chains = chain_labels))+
labs(
x = "Number of iterations" #,
# subtitle = "Number of chains",
# subtitle = "Median score and inter-quartile range of predicitons against truth",
# title = "Performance of Consensus inference across simulations"
)
# New facet label names for cluster variable
iter_labels <- c(paste0("Number of iterations: ", iter_used))
names(iter_labels) <- iter_used
p2 <- ggplot(
plt_data %>% filter(N_iter %in% iter_used),
aes(x = as.numeric(N_chains), y = ARI, group = N_iter)
) +
geom_line()+
geom_ribbon(aes(ymin = ARI_lb, ymax = ARI_ub, group = N_iter), colour = NA, alpha =0.3) +
facet_wrap(~N_iter, ncol =1, labeller = labeller(N_iter = iter_labels)) +
labs(
x = "Number of chains" # ,
# subtitle = "Number of iterations"
)
p1 + p2 +
plot_annotation(
title = 'Performance of Consensus inference across simulations',
subtitle = "Median score and inter-quartile range of ARI (predictions against truth)"
)
I was thinking (for the real data) of plotting PCA series (is there a nice term for a series over an ordered variable?). Example:
# Parameters defining generated data
K <- 5
n <- 100
p <- 20
# Generate a dataset
my_data <- generateSimulationDataset(K = K, p = p, n = n)
# PCA
pc1 <- prcomp(my_data$data)
# Include a ribbon about the inter-quantile range
p3 <- suppressWarnings(pcaSeriesPlot(pc1$x,
labels = my_data$cluster_IDs,
include_area = T,
n_comp = 6
))
# New facet label names for cluster variable
cluster_labs <- c(paste0("Cluster ", 1:5))
names(cluster_labs) <- 1:5
# Create the plot
p3 +
facet_wrap(~Cluster, labeller = labeller(Cluster = cluster_labs)) +
labs(title = "PCA plot including medians and inter-quartile range") +
theme(legend.position="none")
## Warning: Removed 1400 rows containing missing values (geom_point).
## Warning: Removed 1400 row(s) containing missing values (geom_path).
## Warning: Removed 70 row(s) containing missing values (geom_path).
This is a way of visualising the different clusters and the signatures that define them; I think the visualisation of the median this way makes it easier to see that all of the clusters do separate out sensibly even if plotting it as a process across components might be misleading.
Thinking about possible extensions (not for now):
Also, would combining 2. and 3. go some way to making this a biclustering method? As some items would cluster together under some features but not others we would have some information for clustering items and features.
Law, Martin H, Anil K Jain, and Mário Figueiredo. 2003. “Feature Selection in Mixture-Based Clustering.” In Advances in Neural Information Processing Systems, 641–48.